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Abstract 

We consider the problem of succinctly encoding a static map to support approximate queries. 
We derive upper and lower bounds on the space requirements in terms of the error rate and the 
entropy of the distribution of values over keys: our bounds differ by a factor loge. 

For the upper bound we introduce a novel data structure, the Bloom map, generalising the 
Bloom filter to this problem. The lower bound follows from an information theoretic argument. 

1 Introduction 

The abihty to query a map to retrieve a value given a key is fundamental in computer science. As 
the universe from which keys are drawn grows in size, information theoretic lower bounds imply 
that any data structure supporting error-free queries of a map requires unbounded space per key. 
However, if we are willing to accept errors, constant space per key is sufficient. 

For example, in information retrieval we may wish to query the frequencies (values) of word 
sequences (keys) in documents. A priori these sequences are drawn from a universe that is expo- 
nential in the length of a sequence. Returning an incorrect value for a small proportion of queries 
may be acceptable, if this enables us to support queries over a far larger data set. 

Consider a map consisting of n key/value pairs M = {{xi,v{xi)), {x2,v{x2)), ■ ■ ■ , {xn,v{xn))}, 
where the keys X = {xi,X2, • • • , Xn} are drawn from a large universe U and each value v{x) is drawn 
from a fixed set of possible values V = {^1,^2, • • • ,Vb}- Suppose further that the distribution of 
values over keys is given hy p = ipi,P2, ■ ■ ■ ,Pb)- Thus Xi = {x £ X \ v{x) = Vi} then \Xi\ = piU. 

We consider the problem of constructing a space-efficient data structure supporting queries on 
M. For any key x £ U the data structure should return the associated value v{x) if x G X, 
otherwise (i.e. if x G U\X) it should return _L y. 

Using an information theoretic argument, we derive lower bounds on the space required to solve 
this problem when errors are allowed. These lower bounds are in terms of the error rate and the 
entropy of the distribution of values over keys H{p). 

We introduce the Bloom map, a data structure generalising the Bloom filter [Tj to the approxi- 
mate map problem. The space requirements of this data structure are within a log e factor of the 
lower bound. To be precise for an error rate of e the Bloom map uses log e(log 1/e + H[p)) bits per 
key 
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To our knowledge, this paper is the first to make use of the distribution of values over keys to 
analyse the approximate map problem. Moreover, the Bloom map is the first data structure to 
take advantage of this distribution to save space by using variable length codes for distinct values. 
In the many practical settings where distributions with low entropy are encountered we expect the 
Bloom map to be of significant interest. 

The main prior work on the approximate map problem is the Bloomier filter introduced by 
Chazelle et al. [3]. To store key /value pairs with values drawn from a range of size b with false 
positive probability e, the Bloomier filter requires Q;(log 1/e + log 6) bits per key effectively using a 
fixed width encoding for any value in the range. It always returns the correct value for any x £ X. 

The Bloomier filter uses a perfect hash function introduced earlier by Czech et al. [5j whose 
analysis implies that the optimal a is approximately 1.23. A simple calculation shows that in many 
cases the Bloom map will use less space. In fact it is also straightforward to extend the Bloom map 
to make use of the same family of perfect hash functions thereby reducing its space requirements 
to 1.23(logl/e + F(p)). 

In the next section we give a complete statement of the problem and prove lower bounds on 
the space requirements of any data structure supporting approximate queries of a static map with 
bounded errors (our most general result is Theorem [2]). In section [3] we introduce the Simple Bloom 
map, a data structure supporting approximate queries that has near-optimal space requirements. 
In section [4] we present more computationally efficient versions of the Bloom map. 

2 Problem statement and lower bounds 

Consider a map of n key/value pairs M = {{xi,v{xi)), {x2,v{x2)), • • • , where the keys 

X = {xi,X2, ■ ■ ■ ,Xn} are drawn from a large universe U of size u and each value v{x) is drawn 
from a fixed set of possible values V = {vi,V2, ■ ■ ■ Suppose further that the distribution of 

values over keys is given hy p = {pi,P2, ■ ■ ■ ,Pb), where Yl'i=iPi — ^ ^^'^ ^^^ie[b]Pi > 0. Thus if 
Xi = {x £ X \ v{x) = Vi} then \Xi\ = pin. We call such a collection M of key/value pairs a p-map. 

We consider the problem of constructing a space-efficient data structure supporting queries on 
a static p-map M. For any key x £ U the data structure should return the associated value v{x) if 
X £ X, otherwise it should return _L 1/. We will be interested in the case when n is large, n 
and b,p = {pi,P2, ■ ■ ■ ,Pb) are constant. 

Given u, n, b and p = ipi,P2, ■ ■ ■ ,Pb) the total number of distinct p-maps is 



\nj \pin,p2n, . . . ,pbn/ 

By Stirling's formula the multinomial coefficient is 2"'^(p^+'-^(^°§"), where H{p) = — X]i=iPj logP« is 
the entropy of p. (Logarithms here and elsewhere are base two.) Hence to distinguish between all 
p-maps without errors we require m > n{logu — logn + H{p) + o(l)) bits. For n large and n 
this is prohibitive: in particular we require more than a constant number of bits per key. Hence we 
are obliged to consider lossy data structures. 

There are three distinct types of error that we will consider: (False positives) x £ U\X is 
incorrectly assigned a value Vi £ V; (False negatives) x £ Xi is incorrectly assigned the value _L; 
(Misassignments) x £ Xi is incorrectly assigned a value v £ V\{vi}. 

Let s be a binary string supporting queries by keys x £ U, i.e. s : U V U {-L}. Suppose that 
we use s to encode a p-map M with key set X. We wish to bound the proportion of keys on which 
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s returns an incorrect value. For i E [b] we define 



\{xeU\X\s{x)^±}\ 

\u\x\ 



f:{s) 



\{x £ Xj I s{x) £ V\{vi}}\ 



\{x e X, I s{x) = ±}\ 

\X^\ 



Thus f~^{s) is the proportion of false positives returned on U\X, f*(s) is the proportion of misas- 
signed values on Xi and f~{s) is the proportion of false negatives returned on Xj. 

Given constants e"*" > and e*,e~ > we will say that s {e~^ , e* , e~)- encodes M if it satisfies: 
f~^{s) < e+ and, for all i G [b], f*{s) < e* and /j (s) < e~- (We will assume throughout that 
max{e+, < 1/8.) 

If the only errors we allow are false positives then we have an (e+, 0, 0)-encoding data structure. 
(An example of such a data structure is the Bloomier filter j3]). Theorem [T] gives lower bounds on the 
space requirements of such a data structure. (The proof follows a counting argument generalising 
the argument applied to the approximate set membership problem by Carter et al. [2].) 

Theorem 1 The average number of bits required per key in any data structure that (e"'",0, 0)- 
encodes all p-maps is at least 



Proof. Suppose that the m-bit string s (e+, 0, 0)-encodes some particular p-map M with key set X. 
For i G [b] let A^^ = {x ^ U \ s{x) = Vi}, a[^^ = \Af''^\ and define q^^^ by af^'^ = piU + e+(ii — n)q^'^\ 
Since Xi = {x X \ v{x) = Vi} has size piU and s always answers correctly on Xi we have q^'^^ > 0. 



If N is any p-map with key set Y that is also (e"*", 0, 0)-encoded by s then, since s correctly 
answers all queries on keys in Y, we have Yi = {y Y \ v{y) = Wj} C A^^ \ for all i G [6]. Hence, 
since \Yi\ = piU, s can (e"*", 0, 0)-encode at most the following number of distinct p-maps 



logl/e+ +H{p) + o{l). 




lince /+(s) < e+, this implies 




Choosing gi, (?2 



... ,qb > to maximise this expression, subject to X]i=i Qi ^ we have 





1=1 



Dividing by n, recalling that X]i=iP« ~ ^ ^^'^ rearranging we obtain 




+ log 1 - - 
V u 
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Our assumption that u ^ n (which is equivalent to n/u = o(l)) together with the fact that 
log(l + a) = 0{a) for a small implies that the last two terms are o(l). Hence the average number 
of bits required per key satisfies 

b 

— > log l/e+ + Vp» log l/qi + o(l). 
n ^-^ 

i=l 

Gibbs' inequality implies that the sum is minimised when qi = pi for all i G [b], the result follows. □ 
This calculation can be extended to the case when errors are also allowed on keys in the set X. 

Theorem 2 The average number of bits required per key in any data structure that {e~^ ,e* ,€~)- 
encodes all p-maps is at least 

(1 - e") log l/e+ + (!-£-- e*)H{p) - H{e-,e*, 1 - e" - e*) + o(l). 

Proof: The basic idea behind the proof of this result is the same as that of Theorem [H however 
the details are somewhat more involved. (See Appendix.) □ 
The Bloom map, which we introduce in Section [3l is (e, e, 0)-encoding. To enable us to evaluate 
how far its space requirements are from optimal we give the following simple corollary. 

Corollary 3 The average number of bits required per key in any data structure that (e, e, 0)-encodes 
all p-maps is at least 

(1 - e)(log 1/e + H{p) - (e + e^)) + o(l). 
Proof: Substitute e"*" = e* = e and e~ = into Theorem [2] and use log(l — e) > — (e + e^). □ 

3 The Simple Bloom map 

Let M be a p-map with key set X. Thus, for i £ [b], Xi = {x £ X \ v{x) = Vi} has size piU. Our 
first succinct data structure supporting queries for M is the Simple Bloom map. This is constructed 
by simply storing the values directly in a Bloom filter. 

Let B be an array of size m that is initially empty. For each i E [b] we choose ki > 1 independent 
random hash functions hij : U [m] (we will explain how to set ki,k2, ■ ■ ■ ,kb optimally below). 
To store the key/value pair {x,Vi) we compute hij{x) for each j £ [ki] and set the bits B[hij{x)] 
to one. To query B with a key x £ U we compute hij{x) for each i £ [b],j £ [ki] and set 

qval(x) = [i£ [b] I /\f^^B[h,,j{x)] = l} . 

If qval(x) = we return _L otherwise we return Vc, where c = maxqval(x). Note that if {x,Vi) £ M 
then i £ qval(x) and so _L is never returned when querying x, i.e. there are no false negatives. 
However both false positives and misassignments can occur. 

Let t = nY^i^iVi^i be the total number of hashes performed during the creation of B. Let p 
be the proportion of bits that remain zero in B. If f~^{B) is the false positive probability of B, 
i.e. the probability that B returns v 7^ _L for a fixed x £ U\X, then 

b b 
f+{B) = Pr{qval(x) / 0} < ^Pr{i £ qval(x)} = ^{l - pf\ 

i=l i=l 
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If fl{B) is the misassignment probabihty for B over keys in Xi, i.e. the probabihty that B returns 
V £ V\{vi} for a fixed x £ Xi, then 

b b b 

f*{B) = Pr{maxqval(x) > i} < ^ Pr{i G qval(x)} = ^ - Pp < - pf'- 

j=i+l j=i+l 1=1 

Hence in order to minimise f~^{B) and f*{B) we consider the constrained optimisation problem: 
minimise — p)^^ subject to Yl'i=iPi^i = V^- ^ standard application of Lagrange multipliers 

yields the solution 

t H{p) +logpj 

rCj — r 



n log(l - p) 
For this choice of the ki we have 

b b 
^(1 - = (1 - py/^2^iP) J2Pi = 2^(p^(i - p)*/". 

i=l i=l 

By a simple martingale argument, identical to that given by Mitzenmacher j6j for the Bloom filter, 
p is extremely close to its expected value \i t = 0{m) (see Appendix). Assuming p>E[p] we have 

2H(p}(^i _ py/n < 2^^(p) _ (^1 _ . 

This last expression (without the factor 2^^^) is familiar from the standard Bloom filter error 
analysis: it is minimised at i = mln2, when it equals 2^^^~~^^'^ . (Note that as for the standard 
Bloom filter the expected proportion of bits set in B is 1/2.) 

Thus to guarantee f^{B) < e and f*{B) < e for all i G [b] it is sufficient to take 

m = nloge(log 1/e + i7(p)), fcj = log 1/e + log 1/pi for z G [6]. 

(As with the standard Bloom filter, the ki must be integers, for simplicity we will ignore this.) 

Since Corollary [3] gives a lower bound for the space required by an (e, e, 0)-encoding data struc- 
ture we would like to claim that B is (e, e, 0)-encoding. This is not quite true: the expected 
proportion of false positives and misassignments is at most e but this does not guarantee that B is 
(e, e, 0)-encoding. However B is still essentially (e, e, 0)-encoding since, with high probability, the 
proportion of false positives or misassignments is at most e + 0{l/^/n). (See Appendix for details.) 



Theorem 4 The Simple Bloom map {e,€,0)- encodes all p-maps and uses loge(log l/e-|-i?(j?)) bits 
per key. 

Note that by Corollary [3] the space requirements of the Simple Bloom map are essentially a factor 
(1 — e)~^ loge from optimal, for e < 0.01 this is less than 1.46. 

We remark that an (e, e, e)-encoding data structure can be created from the Simple Bloom 
map by simply discarding epiU keys from Xi for each i G [b]. The amount of memory saved is 
enloge(log 1/e + i/(p)) (cf. Theorem[2]). 

Although the Simple Bloom map is succinct it suffers from two obvious drawbacks if b is not 
small: the number of hashes/bit probes performed during a query and the number of independent 
hash functions required is 0(61og(6/e)). In section |4] we explain how to overcome these problems 
by "reusing" hash functions and using an optimal binary search tree. 
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Store{x, Vi,T,B) Query{x, T,B) F\ndya\{x,v,T,B) 

for d = to ly- <— _L w <— r(T) 

w^Py.nTd F\ndva\{x, v,T,B) ior j = 1 to 

for J = 1 to ku) return v if B[hwj{x)] = then return false 



B[hw,j{x)] <— 1 if w is a leaf then 



V <— va\{w); return true 
if Findval(3;, f , Tfj, i?) then return true 
return Findval(x, T/,, i?) 



Figure 1: Storing and querying keys in a Bloom map 

4 Efficient Bloom maps 

Let M be a p-map that we wish to store. Sort the list of probabilities of keys so that pi > P2 ^ 
■ ■ ■ ^ Pb- Construct an optimal alphabetic binary tree T(p) for p with leaves labelled vi,V2, ■ ■ ■ ,Vf) 
(by for example the Garsia-Wachs algorithm, see Knuth [4j page 446). The label of a leaf w is 
denoted by val(it;). Note that T(p) is a full binary tree, i.e. every node is either a leaf or has exactly 
two children. 

For any binary tree T let r(T) denote its root and Tl, Tji denote its left and right subtrees 
respectively. For any node w let denote the set of nodes on the path in T from the root to w 
and let Iw = \Pw\ — ^ be the depth of w. For d > let be the set of nodes in T at depth d. 

We number the nodes in T{p) from left to right at each level, starting at the root and going 
down. We call these numbers offsets. (So the root has offset 0, its left child has offset 1 and its 
right child has offset 2 etc.) Note that all nodes have distinct offsets. The offset of a node w is 
denoted off{'w). To each node w E T{p) we also associate an integer k^u- We will specify choices for 
the kw later, we first impose two simple conditions: k^ > I for all nodes and k^ > logl/e for all 
leaves. Set 

m = loge ^^Pi^ I , k = max kw (1) 

\i=i ?iieP„. / '^^^^ wePvi 

We now impose a third condition on the A:^: they are chosen so that m < 2nlogelog(6/e). 

Let /ii, /i2, . . . , /ifc be independent random hash functions, hj : U ^ [m]. (We will refer to these 
as the base hash functions.) For a node w let = '^ueP^\{w} ^w- We associate k^ hash functions 
with w: hw^i,hw^2, - ■ ■ ,hw,k^, where h^j : U ^ [m] is defined by h^j{x) = hs^+j{x) + off(w) 
mod m. 

The Bloom map B is an array of size m that is initially empty (all bits are zero). To store 
a key/value pair (x,Ui) we use the algorithm Store(x, fj, T(p), i?) (see Figured]). This does the 
following: for each node w in the path P^^^ starting from the root, it evaluates the associated k^j 
hashes at x and sets the corresponding bits in B. Note that (ignoring offsets) the hash functions 
used while storing (x, vi) are /ii, . . . , /i*- , where ti = Y1iw<pp ^w- Hence the bits which are set 
in B by Store{x,Vi,T(p), B) are chosen independently and uniformly at random. Moreover, since 
each key is stored with at most one value, the entire process of storing the p-map in B is equivalent 
to setting t = nY^i^iPiU independently chosen random bits in B. 

To query B with a key x G C/ we use the algorithm Query(j;, T(p), i?). This calls Find- 
val(x, f , T(p), i?) with v initialised to _L and returns the value of v when Findval {x,v,T{p), B) 
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terminates (see Figure [TJ. Starting with T{p), Findval evaluates the hash functions associated with 
the root of the current tree, returning false if it finds a zero bit in B, otherwise it continues down 
the tree, first looking at the right subtree and then, if this fails, looking at the left subtree. If it 
reaches a leaf at which the corresponding bits in B are all set then v is assigned the value associated 
with this leaf and it returns true, otherwise the value of v will remain equal to _L. 

By our choice of m = t log e the expected proportion of bits that remain zero in B (once we 
have stored the p-map M) is 1/2 and with high probability the actual proportion, which we denote 
by p, is very close to this. For simplicity we will assume that p > 1/2. 

We now consider the probability of errors. To simplify our analysis we assume that any leaf Vi 
is at depth log 1/pi (since T{p) is an optimal alphabetic binary tree this is almost true). For x £ U 
and i £ [b] define 

Hiix) = {hi,j{x) \weP^^,le [ky,]}, qval(x) = {i e [b] \ Ahen,ix)B[h] = l} . 

Thus i G qval(2;) iff all of the bits in B indexed by the hash functions on the path P^^ evaluated at x 
are set. If qval(x) = then Query returns _L, otherwise, since Findval always explores right subtrees 
first, it returns Vc, where c = maxqval(x). If x £ Xi then i € qval(x) and so no false negatives can 
occur. False positives and misassignments are possible, we consider the case of false positives first. 

If X G U\X then for fixed i G [b] the bits in TCi{x) are simply independent random choices 
from [m]. This is because if ti = YlweP^ then the hash functions we evaluate are simply offsets, 
modulo m, of the first ti of our base hash functions. By our assumptions that: A:,;, > 1 for all nodes; 
kv^ > logl/e and Vi is at depth logl/pi, we have ti > — log epi. Since p > 1/2 the false positive 
probability satisfies 

b b b 

f+{B) = Pr{qval(x) 0} < Pr{i S qval(x)} < ^^(1 - pf^ < ^ — < e. 

1=1 i=l 1=1 

Calculating the probability of a misassignment when B is queried with x £ Xi is more involved. 
Note that if an incorrect value Vj ^ Vi is returned for x £ Xi then j > i. For i < j and 
X £ Xi let Pij = P^.\P^. be the part of the path P^. that is disjoint from the path P^. and let 
7iij{x) = {h^ i{x) I w £ Pij,l £ [kw]}- The misassignment probability satisfies 

b b 
ftiB) = Pr{maxqval(x) > i} < Yl ^^i-?' ^ ^^^K^)} = E { AheW.,,(:.)i?[/i] = l} • (2) 

j=i+l j=«+l 

To bound this probability we consider the following: suppose that rather than storing all of the 
key /value pairs from M in i? we had instead stored all of them except {x,Vi). Let B' denote the 
resulting m-bit array. )|. Since {x,Vi) has not been stored in B' we have (by the 

same argument as used for f~^{B)) that 

P^^{^hen.A^)B'[h] = l}<^. (3) 

If all of the bits in B indexed by elements in Tii j are set then either they are all set in B' or 
there must be at least one bit in TCij that is only set once (x, Vi) is stored. The later case can only 
occur if 7ii n j / 0. Hence 

Pr {AheH.A-)^i^^ = l} < Pr [A,,^n.,ix)B'[h] = l} + Pr{W, n W,,, / 0}. (4) 
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If hi G "Hiix) and /12 S Ti.ij{x) then Fv{hi = /i2} is either 1/m or 0, since hi and /12 either use 
different base hash functions (and so are independent and random in [m]) or they use the same 
base hash function with different offsets and hence are distinct. 

Recall that k = maxj£[b] X]«,ep„. ^«>- If c S [b] satisfies k = X^^gp^ k^ then m > npckloge. 
Moreover were chosen so that n <m < 2n log e log(6/e). Hence 

P,{„. n «.„ ^ »} < M±M < ^! < ( ™ ) ^ i < nj^) ' i = o (i) . (5) 

m m \ npc log e J m \ J n V'^/ 

where the final equality uses our assumption that b and e are constant. 
Combining ([2]), ([SD, g]) and ([5]) we obtain 

j=i+l ^ ^ i=«+i 

where tjj = X^^^gp. , kw Thus to ensure f*{B) < e we choose the /c^ so that 2^*''^ < e. 

There are various ways in which this can be done and exactly how we choose the k^] will effect 
not only fl{B) but also the memory required to store the Bloom map and the amount of work we 
expect to do when querying it. Since different space/time trade-offs may be of interest in different 
applications we define two special types of Bloom map: Standard and Fast. 

• (Standard) A;^ = 1 for all internal nodes (i.e. all non-leaf nodes), ky. = log l/e-|-log(i/;, — 1) -|- 1 
for all leaves (where Hb = Yl'i=i 1/^ is the 6th Harmonic number). 

• (Fast) kw = 2 for all internal nodes, ky- = log 1/e -|- 2 for all leaves. 

Theorem 5 The Standard and Fast Bloom maps are both (e, e,0) -encoding for p-maps. The aver- 
age number of bits required per key is: 

• (Standard): log e(log l/e + H{p) + log{Hb + 

• (Fast): loge(log 1/e + 2H{p) + 2). 

If X £ U\X then the expected number of bit probes performed during Query{x,T{p}, B) is at 
most: (Standard) H{p) + 2; (Fast) 3. 

If X £ Xi then the expected number of bit probes performed during Query{x, T(j)), B) is at most: 
(Standard) 0((log 6)^) + log log 1/e; (Fast) 31og(6- i + 1) 2 log l/p^ -h log 1/e + 2. 

The Standard Bloom map uses little more than a factor (1 — e)~^ loge extra bits per key than the 
lower bound of Corollary [3l (In addition to the factor of (1 — e)~^loge it uses at most an extra 
1 -|- log log 6 bits per key, since Hb < log 6.) The Fast Bloom map uses slightly more space but has 
the advantage of using significantly fewer bit probes when querying keys: in particular we expect 
to perform at most 3 bit probes on x G U\X. In any case the Fast Bloom map uses less than 2.9 
times as much memory per key as the lower bound and if H(p) is small compared to log 1/e this 
factor will be much closer to 1.46. 

We note that other choices for the kw are possible and depending on the application may be 
desirable. For example, altering the Fast Bloom map by adding s > 1 to kj., where r is the root of 
T{p), yields a Bloom map that will perform 2 -|- 1/2* bit probes on average, for x € U\X. Another 
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possibility is to alter the Standard Bloom map by adding \og{H{p} + 2) to the value of kr giving 
a Bloom map which performs the same expected number of bit probes as the Fast Bloom map on 
X £ U\X and the same expected number of bit probes as the Standard Bloom map on x £ X. 

Proof of Theorem \M' We first show that both Bloom maps are (e, e, 0)-encoding. We know 
already that f^{B) < e so we consider f*{B). We require the following simple lemma. 

Lemma 6 Let T be a full binary tree with leaves vi,V2, ■ ■ ■ ,Vb at depths h < I2 ^ ■■■ ^ lb- 
fa) Ifl<i<j<b then the number of nodes in Py.\Py. is at least log (^^'{^i 2'^^''' 

(b ) If Td is the set of nodes in T at depth d then 

(c) The number of left branches on the path P^. is at most log(6 — i + 1). 

Proof: These are all straightforward, see Appendix for details. □ 
Lemma[6](a), together with our assumption that Vk is at depth log 1/pk in T{p) and the fact that 

Pi > P2 ^ • • • > Pb implies that the number of internal nodes on Pij is at least log (^'j=iPk/Pj^ — 1- 
Let a be the common value of for all internal nodes. By ^ we have 

b b / \" 1 ^ 1 



where the last inequality follows from the fact that pj < pk for all z < < j. In the case of the 
Standard Bloom map we have a = 1 and k^^ = log 1/e + log(-fff, — 1) + 1, hence f*{B) < e. For the 
Fast Bloom map a = 2, ky. = log 1/e + 2 and X^fcli ^/^^ = imply that 



1=2 ^ ' 



Hence both Bloom maps are (e, e, 0)-encoding. 

Now consider how much work we expect to do when querying B. We measure this in terms of 
the expected number of bit probes performed. (Note that as described each bit probe performed 
by Findval involves the evaluation of a hash function, this need not be the case. The use of offsets 
ensures that we never need to evaluate more than k = maxjgr^i Xlipp base hash functions, 
different offsets can then be added as required.) We consider the cases a; E X, x G \J\X separately. 

Let negbp denote the expected number of bit probes performed by Query (x, T(p), for x S 
\J\X. The easiest case is the Fast Bloom map, in which every internal node w has k^, = 2. Let 
negbp(T) be the expected number of bit probes performed by Findval in a tree T. We wish to find 
negbp = negbp(r(^). Starting from the root of T{p)) we have 

negbp(r(p)) < 1 + ^ + \{neghv{TL{p)) + negbp(rR(p))), 
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since if 61,62 are the first two bit probes then Pr{6i = 0} = /3 > 1/2 and Pr{6i = 62 = 1} = 
(1 — p)^ < 1/4. Iterating and using the fact that all nodes in T{j)) have at least two associated bit 
probes we find 




j=0 



In the Standard Bloom map ky^ = 1 for every internal node, hence if w is at depth then the 
probability that the bit probe associated with w is evaluated during Query (x, r(^), i?), is at most 
2~'™. Moreover for a leaf Vi at depth logl/pi the probability that Findval performs more than one 
bit probe at Vi is at most pi/2 and in this case we expect to perform at most two extra bit probes 
at the leaf. Hence if T(i{p) is the set of nodes in T{p) at depth d then the expected number of bit 
probes performed during Query(x, T(p), 5) is at most 

negbp<2x:|+f;^=i+f:^. 

1=1 d=0 d=0 

By Lemma [6] (b) this is at most H{p) + 2. 

Finally we calculate the expected number of bit probes performed by Query {x, T (p), B), for 
X £ Xi, which we denote by posbp(i). This will be the number of bits set during Store(2;, Vi,T{p), B), 
plus the expected number of bit probes performed by Findval in the "false subtrees" it explores, 
where a false subtree is any maximal subtree disjoint from the path . The number of false subtrees 
is simply the number of left branches in the path P^. , since at each such branch Findval first explores 
the right (false) subtree. By Lemma [6] (c) the number of false subtrees is at most log(6 — i + 1). 
To simplify our analysis we will assume that the bit probes in false subtrees are independent and 
random. By a similar argument to that used during the calculation of the misassignment probability 
above this is essentially true. 

For the Fast Bloom map we expect to perform at most three bit probes in each false subtree. 
Since the number of false subtrees in T{p) is at most log(6 — i + 1) the expected number of bit 
probes performed in false subtrees is at most 31og(6 — i + 1). Since the number of bits set by 
Store(a;, Vi,T{p), B) is 21og 1/pj + log 1/e + 2 we have 

posbp(i) < 3 log(6 - i + 1) + 2 log 1/pi + log 1/e + 2. 

Now consider the Standard Bloom map. Any false subtree is a full binary tree with z < b — i 
leaves and hence corresponds to an optimal binary search tree for some probability distribution 
Q = {liil2i ■ ■ ■ 1 Qz)- Since H{q) < log z < log(6 — i) the expected number of bit probes performed 
in any false subtree is at most log(6 — i) + 2. The number of bits set by Store{x, Vi,T{p), B) is 
log l/pi + log 1/e + log{Hb - 1) + 1. Hence 

posbp(i) = 0((log 6)^) + log l/pi + log 1/e. 

This completes the proof of Theorem [5j □ 
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Appendix 

Proof of Theorem\^ Let M be a fixed p-map with key set X. Suppose that M is (e"'', e*, e^)- 
encoded by the m-bit string s. For i G \b] let a|^'' = |{x G [/ | s{x) = Vi]\ and let wf^ = |{a; G 
U\X I s{x) = Vi}\. Define qf^ > by = e^(u — n)qf\ So a-'^^ < n + e'^{u — n)q\^\ Since 
/^(•s) < we have 

E^f^<l- (7) 

i=l 

We now need to consider how many distinct |>-maps = {{yi,v{yi),[y2,v{y2), . . . , {yn, viyn))} 
can be (e"*", e*, e~)-encoded by the string s. Let Y be the key set of N and for i £ [b] let Yi = {y G 
Y I v{y) = Vi}, so \Yi\ = piu. For < j < 6 let = \{y £ Yi \ s{y) = Vj}\. 

Since < e~ , f*{s) < e* and s returns a value from V U {_L} for each element in Yi we 

have the following three constraints on the yij 

b 

Vifl < e"pin, ^ yij < e*pin, ^ yij = piU. (8) 

jm\{i\ i=o 

We can now bound the number of choices for the yij. Since Y^j=oyi,j — Pi''^^ Viyi is determined by 
fixing the values of yjj for j ^ i. Hence the number of choices for the yij is at most 

_ // + 6 _ 2\ U*p.n + b-2\ 

^P^^l^y ft_2 J^^ ^(^''") { 5_2 )■ 
1=0 ^ ^ ^ ' 

(This is because ([8]) implies that there are at most e~pin choices for j/j^o while Ylje[b]\{i}yiJ ~ ^ 
for some integer < I < e*pin. The number of ways of choosing 6 — 1 non-negative integers whose 
sum is I is (^^^2^)-) For a particular choice of the yij the number of choices for the keys in Y^ is at 
most 

( \ Y\ ( ''T' + {'^ ~ 
\yi,oJ y V Vij 
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(This is because any particular choice for the keys in Yi is given by choosing yifi keys on which s 
returns _L and then choosing yij keys on which s returns vj, for each j £ [b].) 

Let y'iQiV'i i,...ylij be chosen to maximise ([9]) subject to dH). The number of choices for the 
keys in Yi is at most 

2 f e*pin + 6 — 2^^ \ TJ f ^ ~^ ^'^^'^ ~ 



j=i ^ -^hj 

Hence the total number of p-maps which can be (e"*", e*, e~)-encoded by the string s is at most 

V ]n( 



Letting qi,q2, ■ ■ ■ ,qb > he chosen to maximise this expression subject to Yl'i=i < 1 we obtain 



b 



u\ n 



Using <{l)<^<a'' we require 



i=l 



6 — 2 \ / piU 



> ( 1 - - 

u 



Taking logarithms and using 



we obtain 



b b b 
i=l j=0 i=l 



m > -51og(e e*) - V]61og(pin) - V'folog ( 1 + ) - y^PiuH i . . . , ^ ) 

^ ^ V ^ Pi^ J ~l \Pi'n Pin Pin J 

+ E E yi. v^^'z. ) - E E (i + ^^^^^^^) + - - S 

1=1 j=i 1=1 j=i ^ yj / 

Defining rjj- = y'i j/pin; noting that the first three terms in the previous inequality are all O(logn) 
and using log(l + a) = 0{a) for a small we obtain 

m>n'^Pi l'^rijlog{rij/e'^qj)+rifllogrifl\ + 0(log n) + O ( — j . 
^=l \j=i J ^""^ 
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Dividing by n and using u ^ n we find that the average number of bits required per key is at least 
b 

— > ^ Pi ((1 - rifi) log l/e+ + rifl log + log + r^^i log 1/qi) 
^ i=l 

b b 

+ Y1 Yl P'^^'^ ri,j + ^ ^ Pirjj log + o(l). (10) 

^=1 ieM\{i} i=i je[fe]\{i} 

Defining tjj = rij/{l - rj^o - ^'i.i) we have 

b b 

Pirijlogrij = ^pi{l-rifl-ri^i) (log(l - rj^o - "rj.i) - H{ti^i, . . . ,ti^i-i,ti^i+i . . . . 

(11) 

Defining = qj/{l — Qi) and applying Gibbs' inequality we obtain 

b b I 

Y Y Pirijlogl/qj = ^pi(l-rj,o-ri,i) log 1/(1- ^ Ujlogl/u, 

b 

> ^Pi(l - rifi - Ti^i) (log 1/(1 - Qi) + i?(ti,i, . . . . . . 

i=l 

(12) 

Substituting ([U]) and ([HD into (fJOl) yields 

b 



m 
n 



2=1 



> X^Pi ((1 - n,o)logl/e+ +rj,ologrj,o + logrj,i + rj,i log l/q'j) 

6 

+ X]pi(l - rift - ri^i) (log 1/(1 - qi) + log(l - rj,o - n,j)) + o(l). 



Defining r* = '^jt^[h]\{i}'''i,j have (by ([8|)) that r* < e*. We also have rj^o < e and so 



ri i = 1 - rj - n* > 1 - e - £*• Hence 



- > (1 - e-) log l/e+ - H{e',e*, 1 - e" - e*) + Vp,, (r,,^ log 1/qi + r* log 1/(1 - qi)) + o(l) 

i=l 

b 

> (1 - e") log l/e+ - /7(e-, e*, 1 - e" - e*) + (1 - e" - e*) log 1/qi 

b 

+ ^ p.r* log 1/(1-%)+ o(l). (13) 
1=1 

Finally applying Gibbs' inequality and noting that the last summation in ()13p is non-negative yields 
our desired lower bound on the average number of bits required per key 

in 

- > (1 - €') log l/e+ + (1 - e- - e*)H{p) - //(e", e*, 1 - e" - e*) + o(l). 
n 
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□ 

Justification that p, the proportion of zeros in a Simple Bloom map, is sharply concentrated. 

If Yj is the expected number of bits that remain zero in the Simple Bloom map B, conditioned 
on the first j hashes then Yq = E[pm] while = pm. The Yj form a martingale with ll^+i — Yj | < 1. 
Azuma's inequality now implies that for any A > we have 



Flip < E[p] - ^\ < e" 
I m ) 



A2/2 



Hence if t = 0{m) then p is extremely unlikely to be much smaller than its expected value. (Note 
that this argument also implies that the same is true for the more efficient Bloom maps described 
in Section HI) 

Remark on {e,e,0)- encoding. Having given lower bounds on the space required by (e,e,0)- 
encoding data structures in Corollary [3] we would like to claim that the Simple Bloom map B is 
(e, e, 0)-encoding. This is not quite true: the expected proportion of false positives and misassign- 
ments is at most e but this does not guarantee that B is (e, e, 0)-encoding. (This is no different 
from the often overlooked fact that for an ordinary Bloom filter with false positive probability e 
the proportion of keys in U\X for which the filter returns a false positive may be larger than e.) 
However the events "B returns a false positive on query x", x £ U\X, are independent and have 
probability at most f^{B). Hence if Z is the number of false positives in U\X then Z is stochasti- 
cally dominated by the binomially distributed variable Bin(n — n, f^{B)). Using Hoeffding's bound 
for the tail of the binomial distribution we have 



Pr{Z >{u- n)f+{B) + < e 



Hence with high probability the proportion of false positives is at most f^{B) + 0{l/^Ju — n). 
Similarly the proportion of misassignments is (with high probability) at most fl{B) + 0{\/ \/n). 
Thus B is essentially (e, e, 0)-encoding. (Note that a similar argument implies that this also holds 
for the more efficient Bloom maps of Section HI) 

Proof of Lemma 0. First note that if T is a perfect binary tree (i.e. a full binary tree with all 
leaves at the same depth) then the number of nodes on Pj.j, (where Pjj- is the part of the path 
from the root to Vj that is disjoint from the path to Vi), is at least \og{j — i + 1). 

Now extend the tree T to a tree T' by replacing each leaf G {t'i, fi+i, . . . , Vj-i} by a perfect 
binary tree of depth Ij — 1^. By our previous remark the number of nodes on Pij is at least logs, 
where s is the number of leaves lying strictly between Vi-i and Vj^i in T'. Since s = X^'^^j 2'^"'*= 
part (a) now follows. 

For (b) note that if we define Iq = then 



< i + 



2<i 

d=0 1=1 



' I- 



1=1 



For (c) note that if the path from the root to f j has left branches at depths di,d2, . . . ,dt then 
the number of leaves to the right of Vi is at least Yl]=i 2^*~^'^^~^^^ (this is because T is full). Since 



14 



all of the depths of the left branches are distinct and at most — 1, the number of leaves to the 
right of Vi is at least Yl%^o 2"' = 2* — 1. However the number of leaves to the right of Vi is b — i and 
so t <log{b- i + l). □ 
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